Translation Corpus Source and Size in Bilingual Retrieval
نویسندگان
چکیده
This paper explores corpus-based bilingual retrieval where the translation corpora used vary by source and size. We find that the quality of translation alignments and the domain of the bitext are important. In some settings these factors are more critical than corpus size. We also show that judicious choice of tokenization can reduce the amount of bitext required to obtain good bilingual retrieval performance.
منابع مشابه
Title of Thesis: Textual Representations for Corpus-based Bilingual Retrieval Title of Thesis: Textual Representations for Corpus-based Bilingual Retrieval Textual Representations for Corpus-based Bilingual Retrieval
Title of Thesis: Textual Representations for Corpus-Based Bilingual Retrieval Paul McNamee, Doctor of Philosophy, 2008 Thesis directed by: Charles K. Nicholas, Professor Department of Computer Science and Electrical Engineering The traditional approach to information retrieval is based on using words as the indexing and search terms for documents. However, word-based representations have diffic...
متن کاملExploiting the Web as the multilingual corpus for unknown query translation
Users’ cross-lingual queries to a digital library system might be short and the query terms may not be included in a common translation dictionary (unknown terms). In this paper, we investigate the feasibility of exploiting the Web as the multilingual corpus source to translate unknown query terms for cross-language information retrieval in digital libraries. We propose a Web-based term transla...
متن کاملAutomatic extraction of bilingual word pairs using inductive chain learning in various languages
In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficien...
متن کاملBilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval
The present paper will seek to present an approach to bilingual lexicon extraction from non-aligned comparable corpora, phrasal translation as well as evaluations on Cross-Language Information Retrieval. A two-stages translation model is proposed for the acquisition of bilingual terminology from comparable corpora, disambiguation and selection of best translation alternatives according to their...
متن کاملUsing Transliteration of Proper Names from Arabic to Latin Script to Improve English-Arabic Word Alignment
Bilingual lexicons of proper names play a vital role in machine translation and cross-language information retrieval. Word alignment approaches are generally used to construct bilingual lexicons automatically from parallel corpora. Aligning proper names is a task particularly difficult when the source and target languages of the parallel corpus do not share a same written script. We present in ...
متن کامل